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Abstract 

Context.  Writing  software  for  the  current  generation  of  parallel  systems  requires 
significant  programmer  effort,  and  the  community  is  seeking  alternatives  that  reduce 
effort  while  still  achieving  good  performance. 

Objective.  Measure  the  effect  of  parallel  programming  models  (message-passing 
vs.  PRAM- like)  on  programmer  effort. 

Design,  Setting,  and  Subjects.  One  group  of  subjects  implemented  sparse-matrix 
dense- vector  multiplication  using  message-passing  (MPI),  and  a  second  group  solved 
the  same  problem  using  a  PRAM-like  model  (XMTC).  The  subjects  were  students 
in  two  graduate-level  classes:  one  class  was  taught  MPI  and  the  other  was  taught 
XMTC. 

Main  Outcome  Measures.  Development  time,  program  correctness. 

Results.  Mean  XMTC  development  time  was  4.8  hours  less  than  mean  MPI  devel¬ 
opment  time  (95%  confidence  interval,  2. 0-7. 7),  a  46%  reduction.  XMTC  programs 
were  more  likely  to  be  correct,  but  the  difference  in  correctness  rates  was  not  sta¬ 
tistically  significant  (p=.16). 

Conclusions.  XMTC  solutions  for  this  particular  problem  required  less  effort  than 
MPI  equivalents,  but  further  studies  are  necessary  which  examine  different  types  of 
problems  and  different  levels  of  programmer  experience. 
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1  Introduction 


While  desktop  computers  today  are  very  powerful,  there  remain  many  com¬ 
putational  tasks  of  interest  that  conventional  computers  cannot  complete  in 
a  reasonable  time.  Such  tasks  are  especially  common  in  the  domain  of  com¬ 
putational  science,  where  physical  phenomena  (e.g.,  nuclear  reactions,  earth¬ 
quakes,  planetary  weather  and  climate)  are  studied  through  computer  simula¬ 
tion.  For  these  problems,  scientists  must  turn  to  high-performance  computing 
(HPC)  systems.  These  systems  are  able  to  provide  more  processing  power 
than  conventional  systems  through  parallelism:  by  connecting  many  process¬ 
ing  units  together  in  parallel,  such  HPC  systems  are  able  to  obtain  much 
greater  performance,  at  least  in  principle.  In  practice,  it  can  be  difficult  to 
achieve  performance  gains  on  HPC  systems  because  of  the  complexities  in¬ 
volved  in  implementing  efficient  parallel  programs.  While  the  challenges  of 
parallel  programming  have  have  traditionally  been  a  concern  for  the  HPC 
community  alone,  the  rise  of  multicore  architectures  is  making  the  parallel 
programming  challenge  increasingly  relevant  to  all  programmers  [38]. 


Programmers  must  specify  parallelism  explicitly  in  their  source  code  to  take 
advantage  of  HPC  machines.  Researchers  have  proposed  many  different  paral¬ 
lel  programming  models  to  express  parallelism.  It  is  through  the  program¬ 
ming  model  that  the  programmer  specifies  how  the  different  processes  in 
a  parallel  program  coordinate  to  complete  a  task.  Many  models  have  been 
proposed,  with  corresponding  implementations  as  libraries,  extensions  of  se¬ 
quential  languages  (e.g.  C,  Fortran),  and  new  parallel  languages.  These  mod¬ 
els  include:  message-passing[16,37],  threaded[15,31,28,34]),  partitioned  global 
address  space  (PGAS)  [9,32,43],  data-parallel[5,ll],  dataflow  [17],  bulk  syn¬ 
chronous  parallel  (BSP)  [22],  tuple  space [27]  and  parallel  random  access  mem¬ 
ory  (PRAM)  [41,26]. 


The  pilot  study  in  this  paper  addresses  the  following  research  question:  would  a 
PRAM-like  system  offer  measurable  benefits  over  alternative  parallel  systems? 
We  conducted  a  study  in  an  academic  setting  to  compare  the  time  required  to 
solve  a  particular  programming  problem  using  the  XMTC[2]  extensions  to  the 
C  language  (which  supports  a  PRAM-like  model)  versus  using  the  MPI[16] 
library  (which  supports  a  message-passing  model). 


basili@cs.umd.edu  (Victor  R.  Basili),  vishkin@umd.edu  (Uzi  Vishkin), 
gilbert@cs.ucsb.edu  (John  Gilbert). 
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network  backplane 


Fig.  1.  Message-passing  model  of  a  parallel  computer 
1.1  Message-passing  with  MPI 


In  the  message-passing  model,  the  parallel  machine  is  modeled  as  a  set  of  pro¬ 
cessing  elements  that  each  have  their  own  bank  of  addressable  local  memory. 
The  processing  elements  are  connected  to  each  other  over  a  network.  Figure  1 
depicts  this  model:  boxes  labeled  P  are  processing  elements  and  boxes  labeled 
M  are  memory  banks.  Processing  elements  coordinate  to  complete  tasks  by 
exchanging  messages  over  the  network. 

The  MPI  library  is  one  implementation  of  the  message-passing  model  with 
bindings  to  languages  such  as  Fortran,  C  and  C++.  When  an  MPI  program 
runs,  a  fixed  number  of  processes  arc  launched  on  the  parallel  machine,  where 
each  process  is  typically  assigned  to  a  separate  processor.  Each  process  has 
a  unique  ID,  which  can  be  retrieved  with  a  function  call.  Programmers  use 
send  and  receive  function  calls  to  communicate  among  the  different  processes. 
There  are  six  basic  function  calls  in  MPI: 

•  MPLInit  -  initialize  MPI  environment  (called  at  beginning  of  program) 

•  MPLFinalize  -  clean  up  MPI  environment  (called  at  end  of  program) 

•  MPI  Comm  size  -  returns  total  number  of  processes 

•  MPI  Comm  rank  -  returns  ID  of  the  current  process 

•  MPLSend  -  send  a  message  to  another  process 

•  MPLRecv  -  receive  a  message  from  another  process 

While  these  six  calls  are  sufficient  to  implement  any  message-passing  program 
in  MPI,  many  other  functions  are  provided  for  convenience  and  which  may 
provide  better  performance  than  the  basic  send/receive  calls.  They  include 
different  types  of  send/receive  calls  (buffered  vs.  unbuffered,  blocking  vs.  non- 
blocking),  multipoint  communications  (e.g.  broadcast,  scatter,  gather),  reduc- 
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#include  <mpi.h> 
#include  <stdio.h> 
#define  N  3 


Listing  1.  MPI  code 


int  main  (int  argc  ,  char  *argv[])  { 
int  my_id  ,  num_procs; 
int  data [N] ; 

MPI_Init (&argc ,&argv) ; 

MPI_Comm_rank(MPI_COMM_WORLD ,&my_id) ; 
MPI_Comm_size(MPI_COMM_WORLD ,&num_procs) ; 
printf  ("  Hello  from  process  %d  of  7„d\n", 
my_id  ,  num_procs); 

/*  Send  data  from  process  0  to  process  1  */ 
if (my_id==0)  { 

data  [0]  =1  ;  data  [1]  =3  ;  data  [2]  =5  ; 

MPI_Send (data , N , MPI.INT , 1 , 0 , MPI_C0MM_W0RLD ) ; 
}  else  if  (my_id==l)  { 

MPI _Recv( data ,N,MPI_INT  ,0,0, 

MPI_C0MM_W0RLD  , MP I _ STATUS _ I GNORE  )  ; 

} 

MPI_F inalize  () ; 
return  0; 


tion  operations  (e.g.  sum,  product,  maximum),  barrier  operations,  and  timing 
functions  for  performance  analysis. 

Listing  1  shows  an  example  of  a  simple  MPI  program  that  prints  out  the 
process  ID  of  each  process  and  then  sends  an  array  of  integers  from  process  0 
to  process  1. 

The  great  strength  of  the  MPI  model  is  that  it  maps  well  to  a  broad  range  of 
parallel  systems  in  use  today.  While  there  are  some  shared  memory  systems 
where  the  time  to  access  any  memory  address  is  the  same  for  all  processors, 
most  HPC  systems  are  either  distributed  shared  memory  machines  (where 
processors  can  directly  access  all  memory,  but  some  accesses  are  faster  than 
others),  clusters  (where  processors  have  their  own  local  memory  and  are  con¬ 
nected  together  over  a  local  area  network)  or  hybrids.  Accessing  local  processor 
memory  is  typically  much  faster  than  accessing  remote  memory  or  communi¬ 
cating  over  the  network.  Because  MPI  gives  the  programmer  low-level  control 
of  communication,  it  allows  programmers  to  exploit  locality:  they  can  write 
programs  that  minimize  communication  overhead,  thereby  avoiding  costly  re¬ 
mote  memory  accesses  or  network  communications.  Because  of  its  versatility, 
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MPI  is  currently  the  most  widely  used  parallel  programming  method  on  HPC 
systems. 

While  MPI  is  the  most  popular  parallel  programming  technology  in  terms  of 
number  of  users,  it  is  not  well-liked.  MPI  is  considered  difficult  to  program 
compared  to  serial  programming.  In  particular,  MPI  forces  programmers  to 
work  at  a  very  low-level  of  abstraction  to  deal  with  many  of  the  communication 
details.  Several  reports  commissioned  by  the  U.S.  government  have  pointed  out 
the  challenges  of  programming  today’s  parallel  systems  with  MPI  [4,19,21,25]. 


1.2  PRAM-like  with  XMTC 


The  PRAM  model  [18]  is  a  generalization  of  the  Random  Access  Machine 
(RAM)  model,  the  basic  sequential  computing  model  exposed  to  programmers 
in  traditional  programming  languages.  Figure  2  depicts  this  model;  although 
only  a  fixed  number  of  processors  is  shown,  in  the  PRAM  model,  the  PRAM 
theory  permits  assuming  an  unbounded  collection  of  RAM  processors  in  a 
PRAM  algorithm,  as  this  will  be  readily  translated  to  a  fixed  number.  The 
memory  can  also  be  assumed  to  have  an  unbounded  collection  of  memory  cells, 
which  are  accessible  to  all  processors  in  unit  time.  The  main  difference  between 
a  sequential  program  and  a  parallel  program  using  the  PRAM  model  is  the 
existence  of  parallel  for  loops,  where  each  iteration  of  the  loop  is  executed  in 
parallel  on  a  separate  processor.  This  is  typically  referred  to  as  a  pardo-loop, 
short  for  parallel  do. 

When  executing  parallel  loops,  all  processors  execute  the  loop  instructions  syn¬ 
chronously.  The  synchronous  execution  of  the  processors  distinguishes  PRAM 
from  other  shared  memory  models  such  as  POSIX  threads[31]  or  OpenMP[15], 
and  avoids  most  problems  associated  with  race  conditions.  Any  reference  to  the 
PRAM  model  is  usually  associated  with  assumptions  on  the  outcome  of  having 
concurrent  access  to  the  same  memory  location.  The  arbitrary  concurrent-read 
concurrent-write  (CRCW)  convention  allows  concurrent-reads  to  the  same 
memory  locations;  in  case,  of  multiple  attempts  to  write  to  the  same  memory 
location  simultaneously,  an  one  among  the  attempting  write  will  succeed,  but 
it  is  not  known  in  advance  which  one. 

XMTC  is  an  extension  of  the  C  programming  language  that  adds  parallel  di¬ 
rectives  to  provide  a  PRAM-like  model  to  the  programmer.  The  main  addition 
is  the  spawn  directive,  which  provides  support  for  a  PRAM-like  pardo  loop. 
The  directive  will  spawn  multiple  virtual  threads  and  execute  the  ensuing  code 
black  in  parallel.  Within  these  parallel  blocks,  each  thread  is  assigned  an  ID 
which  can  be  accessed  using  the  $  symbol.  There  is  also  the  sspawn  directive, 
that  can  be  nested,  for  launching  a  single  additional  thread. 
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Fig.  2.  PRAM  model  of  a  parallel  computer 


_ Listing  2.  XMTC  code 

#include  <xmtc.h> 

#include  <xmtio.h> 

#define  N  20 

int  main  ( )  { 
int  i  ; 

int  A [N]  ,  B [N]  ,  C [N]  ; 

/ *  initialize  A,B  arrays  */ 

spawn (0 , N-l)  { 

if($  */.  2  ==0)  { 

C  [$]  =  A  [$]  +  B  [$]  ; 

}  else  { 

C  [$]  =  A  [$]  -  B  [$]  ; 

> 

} 

f or ( i =0 ; i <N ; i  ++)  { 

printf  ( " °/0d  "  ,C  [i]  )  ; 

} 

} 


XMTC  implements  a  CRCW  PRAM:  if  multiple  threads  attempt  to  update 
the  same  memory  location  simultaneously,  then  an  arbitrary  one  will  succeed. 
XMTC  also  provides  prefix-sum  directives  that  implement  concurrent  writes, 
among  other  things. 

Listing  2  shows  XMTC  code  where  the  even  elements  of  arrays  B  and  A  are 
added,  and  the  odd  elements  of  B  are  subtracted  from  odd  elements  of  A. 

The  Parallel  Random  Access  Machine  (PRAM)  model  [18]  has  long  been  ad- 
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vocated  as  a  model  for  designing  parallel  algorithms  [24] .  Historically,  PRAM 
has  been  criticized  because  it  assumes  that  processors  run  synchronously  and 
interprocessor  communication  is  free  [14],  assumptions  which  are  severely  vi¬ 
olated  on  currently  available  HPC  systems.  It  is  not  possible  to  program  cur¬ 
rent  systems  using  the  PRAM  model  because  modern  architectures  are  not 
designed  to  support  such  a  model  efficiently.  However,  because  of  the  recent 
trends  in  semiconductor  technology  towards  multicore  processors  [38],  it  may 
soon  be  feasible  to  build  large-scale,  fine-grained  uniform-memory- access  par¬ 
allel  machines  which  could  be  modeled  as  PRAMs.  For  example,  the  XMT 
project  at  the  University  of  Maryland  is  conducting  research  on  how  to  build 
chips[42,l]  that  could  efficiently  support  programs  written  using  a  PRAM-like 
model  [41]. 


1.3  Context  of  the  study 


This  study  is  one  of  a  series  of  studies  being  carried  out  as  part  of  the  DARPA 
High  Productivity  Computing  Systems1  project  (HPCS),  which  is  investigat¬ 
ing  alternative  parallel  programming  models  to  determine  their  impact  on 
productivity  relative  to  existing  models  such  as  MPI.  All  of  these  studies, 
including  the  one  described  in  this  paper,  have  been  carried  out  by  software 
engineering  researchers  (the  first  two  authors  of  the  paper).  These  researchers 
had  full  control  over  the  results  reported  in  this  study  (with  the  exception  of 
XMTC  performance  analysis  in  Section  4.4)  and  have  no  vested  interest  in 
any  one  particular  programming  model  or  technology. 


2  Related  work 


Several  empirical  studies  have  been  previously  done  to  evaluate  the  impact 
of  parallel  programming  technologies  on  productivity,  although  we  found  no 
previous  studies  that  focused  specifically  on  PRAM.  Szafron  and  Schaeffer  ran 
on  a  study  to  evaluate  the  usability  of  a  parallel  programming  environment 
compared  to  a  message-passing  library  [39].  Browne  et  al.  studied  the  effect  of 
a  parallel  programming  environment  on  defect  rates  [6].  Rodman  and  Brors- 
son  [35]  evaluated  performance-effort  trade-offs  in  porting  a  shared-memory 
program  to  use  a  hybrid  shared- memory /message-passing  model.  Addition¬ 
ally,  some  studies  have  been  done  to  evaluate  the  effect  of  a  parallel  language 
on  effort  by  analyzing  source  code  metrics  [8,10,40]. 


1  http:/ /www. highproductivity.org 
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3  Description  of  the  Study 


This  section  first  presents  the  goals  and  hypotheses,  then  gives  a  description 
of  the  study. 


3. 1  Goals 


Stated  in  GQM  form  [3],  the  goal  of  this  study  is  to  analyze  message-passing 
and  PRAM-like  parallel  programming  models  for  the  purpose  of  evaluation 
with  respect  to: 

•  program  correctness 

•  development  time 

from  the  viewpoint  of  the  researcher  in  the  context  of 

•  graduate-level  parallel  programming  classes 

•  solving  small  programming  problems 

Note  that  we  use  the  term  development  time  to  refer  to  the  time  that  the 
subjects  spend  implementing  a  solution  to  the  programming  problem.  In  the 
software  engineering  literature,  this  is  sometimes  referred  to  as  effort  [33],  and 
we  use  the  terms  interchangeably. 


3.2  Hypotheses 


Proponents  of  the  PRAM  model  claim  that  it  is  much  simpler  than  the 
message-passing  model  for  implementing  parallel  algorithms,  since  program¬ 
mers  do  not  have  to  deal  with  issues  such  as  domain  decomposition  and  explicit 
communication  between  processes.  We  use  program  correctness  and  develop¬ 
ment  time  as  outcome  variables  to  measure  ease  of  use. 

Based  on  the  above,  we  consider  the  following  two  hypotheses  in  our  study. 

•  HI:  Programs  written  in  XMTC  are  more  likely  to  be  correct  than  programs 
written  in  MPI. 

•  H 2:  Writing  XMTC  requires  less  development  time  than  writing  MPI  pro¬ 
grams. 


3.3  Study  Design 


To  conduct  this  study,  we  leveraged  existing  graduate-level  parallel  program¬ 
ming  courses  at  two  different  universities.  In  one  class,  the  students  were  given 
a  parallel  programming  assignment  to  implement  in  MPI,  and  in  the  other 
class,  the  students  were  given  the  same  parallel  programming  assignment  in 
XMTC. 

This  study  design  is  a  nonequivalent  control  group  design  [7],  which  is  techni¬ 
cally  a  quasi-experiment  since  subjects  were  not  randomly  assigned  to  treat¬ 
ment  groups. 


3-4  Subjects  and  Groups 


The  subjects  were  students  in  graduate-level  parallel  programming  courses 
at  the  University  of  California,  Santa  Barbara  (UCSB)  and  the  University  of 
Maryland  (UMD).  The  focus  of  the  UCSB  class  was  on  developing  parallel 
programs  to  run  on  the  current  generation  of  architectures,  and  the  course 
covered  MPI  as  well  as  other  models  (OpenMP[15],  Matlab*P[12]).  The  fo¬ 
cus  of  the  UMD  class  was  parallel  algorithms  in  the  PRAM  model,  and  the 
students  solved  parallel  programming  programs  in  XMTC. 

The  students  were  assigned  to  treatment  groups  by  class.  UCSB  students 
were  given  a  problem  to  solve  using  MPI,  and  UMD  students  were  given  the 
identical  problem  to  solve  using  XMTC.  UCSB  students  could  choose  to  solve 
the  problem  in  either  C/C++  or  Fortran. 


3. 5  Procedure 


The  students  in  each  class  were  given  a  parallel  programming  assignment 
which  they  were  required  to  complete  as  part  of  their  course.  This  assignment 
was  one  of  several  assignments  in  the  classes.  Students  were  not  required  to 
participate  in  the  study. 

Subjects  were  given  a  description  of  the  task  to  be  completed,  as  well  as  a 
C  header  hie  which  contained  some  data  structures  necessary  to  complete 
the  assignment.  They  were  given  a  deadline  of  approximately  two  weeks,  and 
worked  on  the  assignment  in  their  own  time.  In  each  class,  the  students  had 
login  accounts  on  a  machine  which  they  were  to  use  for  compiling  and  running 
their  code. 
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The  professors  who  taught  the  course  did  not  have  a  direct  role  in  either 
carrying  out  the  study  or  in  analyzing  the  data. 


3. 6  Study  Task 


The  task  was  to  write  a  function  to  multiply  a  sparse  matrix  with  a  dense 
vector.  The  subjects  were  provided  with  the  data  structures  for  representing 
the  sparse  matrix,  and  these  data  structures  were  identical  for  the  MPI  and 
XMTC  groups.  The  professors  tried  to  ensure  that  the  students  were  exposed 
to  the  same  type  of  information  about  the  problem. 


3. 7  Apparatus 


Development  time  data  was  collected  using  two  different  methods:  self-reported 
and  automatically  collected.  Subjects  kept  track  of  their  development  time 
with  a  self-reported  time  log.  In  the  XMTC  group,  subjects  used  a  web-based 
form  to  enter  their  development  time  data,  and  in  the  MPI  group,  subjects 
used  papers  forms.  We  also  collected  automatic  development  time  data  by 
instrumenting  the  compilers.  These  instrumented  compilers  recorded  a  set  of 
data  (including  timestamps)  at  each  compile.  In  both  groups,  subjects  were 
required  to  compile  and  run  their  code  on  a  remote  machine.  In  the  case  of 
MPI,  this  was  a  departmental  Linux  cluster.  In  the  case  of  XMTC,  this  was  the 
prototype  compiler  simulator  software  that  was  available  on  the  class  server. 
From  these  two  sources  of  data,  we  were  able  to  come  up  with  three  sepa¬ 
rate  estimates  of  development  time:  one  based  entirely  on  self-reported  data, 
one  based  entirely  on  data  from  the  instrumented  compiler,  and  one  based 
on  a  combination  of  the  two  approaches  (more  details  about  our  algorithm 
for  estimating  development  time  based  on  the  instrumented  compiles  and  on 
combining  the  development  time  measures  can  be  found  in  [23]). 

Performance  data  for  the  MPI  programs  was  measured  by  running  the  pro¬ 
grams  on  a  parallel  machine  and  calculating  the  time  spent  doing  rnatrix- 
multiply  using  the  MPI  timing  functions.  Performance  data  for  the  XMTC 
programs  was  measured  using  clock-cycle  counts  from  the  simulator. 

Background  information  was  collected  from  the  subjects  using  on-line  and 
paper-based  questionnaires. 
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Table  1 

Subject  participation 


Total 

Consented 

Completed 

UCSB  (MPI) 

26 

21 

16 

UMD  (XMTC) 

16 

15 

14 

Table  2 
Subject  major 


CS 

CE 

EE 

ME 

CS/M 

Mgrnt 

? 

MPI 

13 

0 

0 

2 

0 

0 

1 

XMTC 

4 

4 

1 

1 

1 

1 

2 

4  Data  analysis 


This  section  presents  a  data  analysis  of  the  results  of  the  studies.  We  use  a 
p-value  of  .05  in  all  statistical  tests  (or,  equivalently,  a  confidence  interval  of 
95%).  All  statistical  tests  were  performed  using  the  R  software  environment,  2 
version  2.0.  Power  analyses  were  performed  using  Lenth’s  Java  applets  for 
power  and  size  [29]. 


4-1  Characterization  of  groups 


Table  1  shows  the  number  of  students  in  each  class,  the  number  of  students  who 
consented  to  participate  in  the  study,  and  the  number  of  consenting  students 
who  completed  the  assignment  and  submitted  a  solution  (the  other  students 
dropped  the  class).  Table  2  shows  a  breakdown  of  the  subjects  by  major  (CS: 
computer  science,  CE:  computer  engineering,  EE:  electrical  engineering,  ME: 
mechanical  engineering,  CS/M:  computer  science  &  math,  Mgrnt:  management 
science,  ?:  did  not  specify  background). 


4-2  Correctness 

•  HI:  Programs  written  in  XMTC  are  more  likely  to  be  correct  than  programs 
written  in  MPI. 

The  programs  were  checked  for  correctness  by  running  them  against  a  known 
input  and  checking  if  the  program  output  matched  the  expected  output.  A 
program  that  crashed  during  execution  was  counted  as  being  incorrect.  Table 
3  provides  a  summary  of  the  correctness  across  classes. 

2  http://www.r-project.org 
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Table  3 
Correctness 


Model 

Number  of  correct  submissions 

MPI 

7/13  (54%) 

XMTC-1 

12/14  (86%) 

XMTC-2 

11/14  (79%) 

x2 

p- value 

p  <  .05 

MPI  vs.  XMTC-1 

1.93 

0.16 

no 

MPI  vs.  XMTC-2 

0.91 

0.34 

no 

Table  4 

X2  test  of  correctness  rates 

For  MPI  correctness,  we  were  only  able  to  evaluate  13  of  the  16  submitted 
programs.  Of  the  remaining  three,  two  were  implemented  in  Fortran  (which 
we  could  not  evaluate  because  of  technical  reasons),  and  for  the  remaining  one 
the  subject  had  not  conformed  to  the  programming  interface  as  given  in  the 
task  description. 

We  wish  to  investigate  whether  there  is  a  difference  in  the  probability  of 
implementing  a  correct  program  in  MPI  vs.  XMTC.  We  use  Pearson’s  y2 
test  [20]  with  Yates’  continuity  correction  to  check  if  there  is  a  statistically 
significant  difference  in  the  frequency  of  correct  solutions.  The  results  of  the 
tests  are  shown  in  Table  4.  While  the  differences  appear  large  in  Table  3  (86% 
vs.  54%),  the  results  are  not  statistically  significant,  and  therefore  we  cannot 
claim  that  HI  is  supported  by  the  data. 


4-3  Development  time 

•  H 2:  Writing  XMTC  requires  less  development  time  than  writing  MPI  pro¬ 
grams. 

We  employ  three  methods  for  measuring  development  time  in  our  analysis: 
self-reported,  instrumented,  and  combined.  Self-reported  measures  are  based 
entirely  on  time  logs,  instrumented  measures  are  based  entirely  on  timestamps 
from  the  instrumented  compilers,  and  combined  measures  are  based  on  com¬ 
piler  timestamps  when  the  subject  is  working  on  the  instrumented  machine, 
and  self-reported  time  when  the  subject  is  working  off  the  instrumented  ma¬ 
chine.  Figure  3  shows  the  distribution  of  development  time  for  the  two  classes 
using  our  three  different  development  time  measures. 

We  compute  95%  confidence  intervals  for  the  differences  in  development  time 
means  to  provide  some  notion  of  effect  size  rather  than  applying  a  t-test[13]. 
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MPI  XMT  MPI  XMT  MPI  XMT 


Fig.  3.  Distribution  of  development  time  (effort) 


Table  5 

Limits  of  development  time  confidence  intervals 


Lower  limit  Upper  limit 

Reported 

4.9  h  15.7  h 

Instrumented 

0.7 h  7.2 h 

Combined 

2. Oh  7.7h 

The  confidence  intervals  for  the  difference  in  development  time  means  are  sum¬ 
marized  in  Table  5  and  depicted  in  Figure  4.  Note  that  for  each  measure,  the 
confidence  intervals  does  not  include  0,  so  we  conclude  there  is  a  statistically 
significant  difference  between  treatment  groups  at  the  p  <  .05  level.  Thus, 

H 2  is  supported  by  all  three  development  time  measures. 

If  we  consider  MPI  to  be  our  reference,  we  can  compute  a  reduction  in  mean 
development  time  in  using  XMTC  over  MPI,  which  we  define  as 


R 


1  - 


E 

1 


xmt 


mpi 


where  Exmt  is  mean  time  to  implement  the  problem  in  XMTC,  and  Empi  is 
mean  time  to  implement  the  problem  in  MPI.  Reduction  in  mean  effort  was 
59%  for  reported  time,  44%  for  instrumented  time,  and  46%  for  combined 
time. 
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Rep  Inst  Combined 


Fig.  4.  Confidence  intervals  for  development  time  differences 

To  evaluate  if  subject  backgrounds  had  an  effect  on  the  development  time 
score,  we  ran  an  analysis  of  variance  based  on  the  responses  on  the  background 
questionnaire.  We  asked  subjects  about  their  current  major  as  well  as  their 
experience  in  various  areas,  including:  general  software  development,  parallel 
programming,  multithreaded  programming,  C  programming,  C++  program¬ 
ming,  and  sparse  matrix  methods. 

We  employed  a  one-way  analysis  of  variance  (ANOVA)  to  check  if  these  vari¬ 
ables  had  a  statistically  significant  effect  on  the  combined  development  time 
scores.  Table  4.3  shows  the  ANOVA  results.  The  only  factor  that  exhibited  a 
statistically  significant  effect  at  the  p  <  .05  level  was  the  parallel  programming 
model. 


4-4  A  note  about  performance 


A  comparison  of  parallel  programming  models  would  not  be  complete  without 
some  consideration  of  the  performance  of  the  resulting  codes.  In  this  case, 
direct  performance  comparisons  are  not  possible  because  MPI  is  a  mature 
implementation  that  runs  on  existing  systems,  and  XMTC  is  a  prototype 
which  runs  only  on  a  simulator.  In  addition,  the  two  models  exploit  parallelism 
differently.  XMTC  uses  a  spawn/join  model  of  parallelism,  where  the  number 
of  active  threads  varies  over  the  lifetime  of  the  process.  In  MPI,  the  number  of 
processes  is  fixed  over  the  lifetime  of  the  program.  Therefore,  the  performance 
of  an  MPI  program  can  actually  worsen  as  the  number  of  processes  is  added 
if  there  is  not  enough  work  to  distribute  efficiently  across  the  processors,  so 
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Table  6 

Analysis  of  variance 


Df  Sum  Sq  Mean  Sq  F  value  Pr(>F) 


Model 

1 

158.10 

158.10 

15.18 

0.0025 

Current  Major 

5 

126.48 

25.30 

2.43 

0.1022 

Software  Development  Experience 

1 

0.36 

0.36 

0.03 

0.8557 

Parallel  Programming  Experience 

1 

0.86 

0.86 

0.08 

0.7795 

Thread  Programming  Experience 

1 

9.44 

9.44 

0.91 

0.3614 

Sparse  Matrix  Experience 

1 

2.07 

2.07 

0.20 

0.6643 

C  Development  Experience 

1 

26.99 

26.99 

2.59 

0.1358 

C++  Development  Experience 

1 

14.79 

14.79 

1.42 

0.2586 

Residuals 

11 

114.58 

10.42 

an  MPI  program  must  be  evaluated  at  different  processor  counts  to  determine 
its  peak  performance. 


Nevertheless,  we  felt  that  the  paper  would  be  incomplete  without  some  discus¬ 
sion  of  performance.  We  use  speedup  versus  a  reference  serial  implementation 
(similar  to  “real  speedup”  [36])  as  a  measure  of  performance,  where  speedup 
is  defined  as 


par 


where  Tser  is  reference  serial  execution  time  and  Tpar  is  parallel  execution 
time.  Speedup  allows  us  to  make  comparisons  across  different  machines.  In 
the  XMTC  case,  we  had  an  XMTC  expert  code  our  reference  serial  implemen¬ 
tation.  For  the  MPI  case,  we  did  not  have  an  implementation  from  an  expert, 
so  we  used  the  fastest  single-processor  MPI  implementation  as  the  reference 
serial  implementation. 

MPI  programs  were  run  on  a  24-processor  Sun  SunFire  system  (a  shared 
memory  machine),  and  XMTC  programs  were  run  on  a  simulator  with  1024 
thread-control  units.  Although  the  MPI  subjects  originally  developed  their 
code  for  a  commodity  Linux  cluster,  we  felt  that  it  would  be  a  fairer  compar¬ 
ison  to  measure  the  MPI  performance  on  a  shared  memory  machine,  where 
there  is  less  of  a  performance  penalty  due  to  communication  among  processes. 

The  MPI  programs  were  timed  when  multiplying  a  50180  x  50180  sparse- 
matrix  containing  1185124  non-zero  elements  with  a  dense  vector  containing 
50180  elements.  The  XMTC  programs  were  timed  when  multiplying  a  30000  x 
100  sparse- matrix  containing  60130  non-zero  elements  with  a  dense  vector 
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1  2  4  8  16  20  24 

#  of  processors 


Fig.  5.  MPI  speedups 


containing  100  elements. 


Figure  5  shows  the  distribution  of  MPI  speedups  for  a  range  of  different  pro¬ 
cessors,  up  to  the  limit  of  24  processors.  The  MPI  programs  scale  up  to  16 
processors,  with  a  median  speedup  of  llx.  However,  as  processors  increase  the 
performance  worsens,  and  when  24  processors  are  used,  the  median  speedup 
of  only  2.2x.  (Note  that  some  super-linear  speedup  occurs  at  2,4,  and  8  pro¬ 
cessors,  most  likely  due  to  cache  effects). 


Figure  6  shows  the  distribution  of  XMTC  speedups  for  the  implementations 
submitted  by  the  subjects.  The  simulation-based  empirical  framework  for 
XMT  speedups  is  taken  from  [30].  The  results  were  obtained  through  a  cycle- 
accurate  simulator  that  was  derived  from  a  synthesizable  Verilog  description 
of  the  XMT  architecture.  The  median  speedup  for  the  subjects  was  157x,  and 
the  speedup  achievable  by  an  expert  was  206x. 


Note  that  for  the  XMTC  implementations,  the  median  speedup  for  the  “tuned” 
implementations  is  lower  than  the  one  for  the  more  straightforward  implemen¬ 
tation.  The  outcome  of  a  paired  t-test  [20]  (p=.007)  confirms  that  there  is  a 
statistically  significant  difference  between  the  two  implementations,  although 
in  the  opposite  of  what  was  originally  expected. 
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Fig.  6.  XMTC  speedups 


5  Threats  to  Validity 

5. 1  Internal 


Selection.  Since  we  were  not  able  to  randomly  assign  subjects  to  treatment 
groups,  the  outcome  of  the  study  may  have  been  affected  by  some  difference 
across  treatment  groups  other  than  the  programming  model  that  was  used. 
Unfortunately,  we  did  not  have  the  opportunity  to  administer  a  pre-test  in 
this  study. 

Selection-history  interaction.  The  subjects  received  different  amounts  of  train¬ 
ing  or  experience  in  problem  type,  programming  model,  etc.  The  professors 
endeavored  to  provide  the  subjects  with  the  same  amount  of  information  about 
the  specific  problem.  However,  the  subjects  had  different  amounts  of  experi¬ 
ence  using  the  programming  models  within  their  class.  In  the  MPI  class,  the 
subjects  had  three  previous  MPI  assignments  before  the  one  involved  in  the 
study,  whereas  in  the  XMTC  class,  the  subjects  had  only  one  previous  assign¬ 
ment. 

The  assignments  were  very  similar,  but  not  exactly  the  same.  The  MPI  sub¬ 
jects  were  also  asked  to  implement  a  conjugate-gradient  algorithm  in  Mat- 
lab*?^]  which  calls  their  sparse- matrix  multiply  function.  We  did  not  collect 
development  time  data  on  this  part  of  the  assignment.  The  XMTC  subjects 
had  to  implement  the  algorithm  twice:  once  using  the  same  sparse-matrix  data 
structures  as  the  MPI  subjects,  and  a  second  time  using  a  different  sparse- 
matrix  data  structure.  We  did  collect  development  time  data  on  this  part 
of  the  assignment  also,  so  our  development  time  measures  for  XMTC  may 
overestimate  the  total  development  time. 
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The  motivations  of  the  students  in  the  two  classes  may  be  different,  based  on 
how  they  expected  the  assignment  to  be  graded.  In  the  MPI  assignment,  the 
students  were  told  that  50%  of  the  grade  would  be  based  on  performance.  In 
the  XMTC  assignment,  the  students  were  not  told  how  performance  would 
affect  their  grade. 

The  general  emphasis  of  the  two  courses  themselves  are  quite  different.  The 
MPI  class  focused  on  existing  high  performance  computing  architectures  and 
practical  issues  (such  as  memory  hierarchy)  that  programmers  must  deal  with 
to  achieve  good  performance.  The  XMTC  class  focused  more  on  the  theory  of 
parallel  algorithms  in  the  PRAM  model. 


5.2  Construct 


Mono- operation  bias.  Solving  a  small  parallel  programming  problem  in  a  class¬ 
room  environment  is  qualitatively  different  from  implementing  a  complete  ap¬ 
plication.  For  example,  in  larger  programs,  more  of  the  code  will  be  inherently 
serial,  and  so  the  effect  of  the  parallel  programming  model  will  not  be  as  pro¬ 
nounced.  On  the  other  hand,  in  this  problem  the  subjects  in  the  MPI  group 
were  given  the  domain  decomposition  for  the  problem.  In  a  real  application, 
MPI  programmers  would  have  to  come  up  with  this  decomposition  on  their 
own,  whereas  XMTC  programmers  do  not  have  to  deal  with  this  issue. 

Even  for  small-scale  problems,  this  single  study  is  not  representative  of  all 
of  the  different  types  of  parallelizable  problems.  This  study  focused  on  one 
particular  problem:  implementing  a  function  to  do  sparse-matrix  dense-vector 
multiply.  This  type  of  problem  was  easy  to  solve  in  parallel  using  XMTC  (pos¬ 
sibly  even  as  easy  or  easier  than  the  equivalent  serial  implementation, although 
this  was  not  explicitly  investigated  here).  Other  types  of  problems  will  be  easier 
or  harder  to  parallelize  using  a  message-passing  model  based  on  their  com¬ 
munication  patterns  (e.g.  “embarrassingly  parallel”  problems  such  as  Monte 
Carlo  simulations,  or  “nearest-neighbor”  problems  such  as  cellular  automata 
simulations). 


5.3  External 


Interaction  of  selection  bias  and  experimental  variables.  The  results  of  this 
study  only  apply  to  novice  parallel  programmers  in  MPI  and  XMTC.  These 
results  cannot  be  generalized  to  more  experienced  parallel  programmers  work¬ 
ing  outside  of  a  classroom  environment,  and  the  study  may  also  be  capturing 
learning  effects.  However,  given  that  XMTC  is  currently  a  research  language, 
there  are  no  experienced  XMTC  programmers  yet! 
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Table  7 
Power  analysis 


fi  >  .5 

(3  >  .8 

MPI  vs.  XMTC-1 

22 

37 

MPI  vs.  XMTC-2 

35 

63 

6  Discussion 


6. 1  Correctness 


We  did  not  find  a  statistically  significant  difference  in  the  correctness  rates 
between  the  different  models.  However,  we  can  use  the  results  obtained  in 
this  study  to  help  plan  the  size  of  future  studies  through  the  use  of  power 
analysis  [29].  Using  the  correctness  rates  we  obtained  in  this  study  (MPI:54%, 
XMTC-1:86%,  XMTC-2:  79%),  we  can  estimate  the  number  of  subjects  we 
would  need  to  detect  an  effect  with  power  (/ 3 )  of  50%  or  80%.  (If  the  expected 
effect  size  is  known  in  advance,  there  is  little  sense  in  conducting  a  study  with 
power  less  than  50%,  since  the  probability  of  a  statistically  significant  result  is 
less  than  a  random  coin  flip,  assuming  the  effect  is  real).  Table  7  summarizes 
the  number  of  subjects  required  in  each  group  to  achieve  the  desired  power. 
Note  that  we  would  need  at  least  22  subjects  in  each  group  to  achieve  a  power 
of  50%  in  comparing  MPI  to  XMTC-1  (recall  that  in  our  study  we  had  16 
subjects  in  the  MPI  group  and  14  in  the  XMTC  group). 


6.2  Effort 


To  try  to  understand  the  differences  in  effort  between  MPI  and  XMTC,  we 
examined  the  source  code  submitted  by  the  subjects.  The  MPI  programs 
are  much  larger  than  their  XMTC  counterparts  (roughly  7  times  larger  than 
XMTC-1  and  2  times  larger  than  XMTC-2  implementations).  This  difference 
in  size  is  because  of  the  additional  source  code  necessary  to  handle  the  com¬ 
munication  between  the  processors.  The  particular  problem  of  sparse-matrix 
dense- vector  multiply  requires  an  “all-to-aU”  pattern  of  communication  among 
processes:  each  process  may  potentially  need  data  from  all  of  the  other  pro¬ 
cesses  to  complete  the  computation. 

MPI  supports  both  point-to-point  (send,  receive)  and  collective  communi¬ 
cation  (e.g.  broadcast,  scatter,  gather,  reduce,  all-to-all,  barrier)  operations. 
While  any  MPI  program  can  be  written  using  only  the  point-to-point  func¬ 
tions,  use  of  the  collective  communication  functions  may  improve  performance, 
depending  on  the  architecture.  There  was  substantial  variation  across  subjects 
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in  their  use  of  MPI  functions.  Only  three  subject  used  strictly  point-to-point 
calls:  the  other  students  used  at  least  one  collective  communication  function. 
However,  the  particular  collective  communication  function  varied  from  one 
subject  to  the  next.  Half  of  the  students  used  the  vector  variants  of  the  col¬ 
lective  communications  calls  (Allgatherv,  Alltoallv),  which  are  used  when  the 
size  of  the  data  being  exchanged  varies  from  one  process  to  the  other  (e.g. 
when  the  number  of  data  elements  does  not  divide  evenly  by  the  number  of 
processes).  These  calls  are  more  complex  than  their  non- vector  counterparts, 
and  their  use  may  account  for  increases  in  effort. 

By  contrast,  the  XMTC  code  requires  no  explicit  communication.  For  most  of 
the  submissions,  the  only  substantial  difference  between  the  XMTC-1  imple¬ 
mentation  and  the  equivalent  serial  implementation  is  the  use  of  the  XMTC 
spawn  function  to  create  one  thread  per  matrix  row  instead  of  an  outer  for 
loop. 

The  XMTC-2  implementations  are  larger  than  the  XMTC-1  implementations 
but  smaller  than  the  MPI  implementations.  The  extra  code  in  these  implemen¬ 
tations  is  devoted  to  dividing  up  the  work  among  a  fixed  number  of  threads. 


7  Conclusions  and  future  work 


We  evaluated  the  claim  that  a  PRAM-like  parallel  programming  model  (XMTC) 
requires  less  effort  than  a  message-passing  model  (MPI),  through  a  quasi¬ 
experiment  conducted  with  students  in  graduate-level  parallel  programming 
courses. 

Our  main  result  is  that  XMTC  programs  required  about  45%  less  effort  than 
MPI  programs.  There  was  insufficient  power  to  detect  a  statistically  significant 
difference  in  the  rate  of  correctness  between  the  two  models.  These  results 
suggest  that  if  architectures  continue  to  evolve  towards  fine-grained  uniform- 
memory  access  parallel  machines,  XMTC-like  languages  are  worth  pursuing. 
However,  further  studies  are  necessary  to  evaluate  this  claim  with  respect  to 
different  types  of  problems,  as  well  as  to  larger  programs. 

While  the  sample  size  of  this  study  was  smaller  than  we  would  have  liked,  ob¬ 
taining  subjects  for  such  studies  is  difficult.  The  population  of  programmers 
who  have  training  in  parallel  programming  is  small,  so  we  rely  on  available 
parallel  programming  courses  for  novice  subjects.  Nevertheless,  we  feel  that 
this  type  of  study  is  a  good  first  step  in  the  continued  empirical  research  of  par¬ 
allel  programming  issues,  and  provides  a  basis  of  comparison  for  future  studies 
which  may  involve  more  experienced  programmers  and  different  programming 
tasks. 
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In  fact,  this  study  is  one  of  a  series  of  studies  we  are  involved  in  to  explore 
the  effect  of  parallel  programming  model  on  productivity.  We  are  also  in¬ 
vestigating  other  parallel  programming  models  (e.g.  OpenMP  [15],  UPC  [9], 
Matlab*P[12]),  as  well  as  other  types  of  parallel  programming  problems.  We 
are  also  conducting  case  studies  of  existing,  larger-scale  parallel  programming 
projects  to  understand  the  differences  between  phenomena  that  we  observe  in 
the  classroom  studies  and  those  that  occur  in  actual  development  projects. 
While  any  single  individual  study  can  only  provide  a  small  amount  of  insight, 
we  hope  that  by  conducting  several  studies  across  multiple  programming  mod¬ 
els,  problem  types,  and  problem  sizes,  we  can  gain  a  clearer  picture  of  how 
these  variables  affect  productivity. 
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A  Raw  data 


Table  A.l  shows  the  raw  effort  data,  in  hours,  using  the  three  measures:  re¬ 
ported  effort,  instrumented  effort  and  combined  effort.  Combined  effort  was 
computed  by  adding  the  instrumented  effort  to  the  fraction  of  reported  effort 
that  corresponded  to  work  done  off  the  instrumented  machine.  (When  filling 
out  the  effort  log,  subjects  indicated  whether  or  not  they  were  working  on  the 
instrumented  machine) . 

Note  that  for  one  of  the  subjects  (subject  3  in  XMTC),  the  subject  consented 
to  participate  but  did  not  turn  in  any  reported  effort,  and  only  two  compiles 
were  logged  for  that  subject,  so  no  effort  data  exists.  Also  note  that  the  subject 
numbers  are  not  necessarily  sequential,  because  students  who  consented  to 
participate  but  did  not  turn  in  a  solution  were  not  considered  as  part  of  the 
study. 
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Table  A.l 

Development  time  (effort)  data 
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